Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Wed Dec 01 21:43:12 2021"
I have used R and RStudio quite a lot in my work. I’m familiar with Git and Github, but have not become ‘fluent’ with them yet – in fact, using them has always felt a bit cumbersome. I hope this course will help me get better with using Git/GitHub.
I heard about the course from my supervisor and department mailing list.
Here is a link to my github page for this project:https://github.com/jpkos/IODS-project
List test:
In this exercise, we are using linear regression to evaluate how different learning approaches affect student’s performance in a statistics course.
We will start off by loading the libraries
library(dplyr)
library(GGally)
library(ggplot2)
In the data wrangling section, we created a csv file with the data. Let’s load that.
data <- read.csv("data/learning2014_csv.csv")
Pick only the combined columns and the columns with participant info
cols <- c("Age", "Attitude", "Points", "gender", "deep", "surf", "stra")
data <- data[cols]
Data overview (dimensions [rows and cols] and structure [variables and their datatypes])
dim(data)
## [1] 166 7
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
THe dataset is about how students learning approaches affect their exam points in a statistics course. The data has combined variables that describe students learning approaches (deep, surface, strategic) and individual variables with participant information (age, gender, attitude towards statistics, exam points)
Summary of the variables:
summary(data)
## Age Attitude Points gender
## Min. :17.00 Min. :14.00 Min. : 7.00 Length:166
## 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00 Class :character
## Median :22.00 Median :32.00 Median :23.00 Mode :character
## Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :55.00 Max. :50.00 Max. :33.00
## deep surf stra
## Min. :1.583 Min. :1.583 Min. :1.250
## 1st Qu.:3.333 1st Qu.:2.417 1st Qu.:2.625
## Median :3.667 Median :2.833 Median :3.188
## Mean :3.680 Mean :2.787 Mean :3.121
## 3rd Qu.:4.083 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.917 Max. :4.333 Max. :5.000
Plot variable pairs to visualize correlations
p <- ggpairs(data, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
Some notes about the variables:
Let’s fit a model with three variables (strategic approach, surface approach, attitude) and the exam points as the target variable
model <- lm(Points ~ stra + surf + Attitude, data=data)
summary(model)
##
## Call:
## lm(formula = Points ~ stra + surf + Attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Attitude has a significant effect, the others do not. Let’s fit another model, this time only with Attitude
model2 <- lm(Points ~ Attitude, data=data)
summary(model2)
##
## Call:
## lm(formula = Points ~ Attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The result indicates that students attitude has a strong correlation with their points. 1-point increase in Attitude score increased the students exam points by about 0.35 (t = 6.124, p < 0.001).
The multiple R-squared measures how well the model fits the data, but also adjusts to the number of fitted variables. The multiple R-squared for the first model was 0.2074. This can be interpreted to mean that the model explains roughly 20.7% of the variance in the data. When we fit the model again with only Attitude as the variable, the R-squared is 0.1906 (~19.1%), meaning that Attitude alone explained a lot of the variance seen in the data.
Diagnosis plots:
plot(model2, c(1,2,5))
In the residuals vs fitted plot, we can see that with larger values the residuals may be getting smaller, although the number of data points is also smaller. Overall there does not seem to be serious heteroscedasticity (residuals should not depend on the fitted values, i.e. they should be randomly distributed).
The QQ-plot shows some deviation from normalcy at the extreme values, however the deviations are smallish and may not violate normality assumptions (errors should be normally distributed).
The residuals vs leverage plot shows that while there are a few points that are kind of outliers, they most likely do not seriously affect the model results.
Finally, let’s plot points vs attitude.
plot(data$Attitude, data$Points)
Let’s start by loading the necessary libraries
library(dplyr)
library(GGally)
library(ggplot2)
library(tidyr)
library(caret)
Load the data we processed in the data wrangling section (actually, I’m using the ready made dataset)
data <- read.csv("data/student/alc.csv")
Print a summary of the data structure
str(data)
## 'data.frame': 370 obs. of 51 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr "R" "R" "R" "R" ...
## $ famsize : chr "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr "T" "T" "T" "T" ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr "at_home" "other" "at_home" "services" ...
## $ Fjob : chr "other" "other" "other" "health" ...
## $ reason : chr "home" "reputation" "reputation" "course" ...
## $ guardian : chr "mother" "mother" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr "yes" "yes" "yes" "yes" ...
## $ famsup : chr "yes" "yes" "yes" "yes" ...
## $ activities: chr "yes" "no" "yes" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "yes" "yes" "no" "yes" ...
## $ romantic : chr "no" "yes" "no" "no" ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ n : int 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : int 1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
## $ id.m : int 2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : chr "yes" "no" "no" "no" ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: int 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : chr "yes" "no" "no" "no" ...
## $ absences.p: int 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : int 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : int 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : int 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: int 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : chr "yes" "no" "yes" "yes" ...
## $ absences.m: int 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : int 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : int 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : int 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
The data describes each subjects school performance together with different study and personal information such as absences (from school), failures (at school), age, family size, education level, etc. G1, G2 and G3 are the grades in three different periods for the given course subject (portuguese or math). Alcohol level is a binary variable (high/low usage).
We will analyse students’ alcohol consumption and its effects on 4 different variables. The dependent variable is high alcohol use (high_use), with the following independent variables and the hypotheses:
First, let’s plot the distributions of these variables:
cols <- c("sex", "failures", "absences", "romantic", "high_use")
gather(data[cols]) %>% ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar()
Fit a model with these variables:
model <- glm(high_use ~ absences + failures + sex + romantic, data=data, family="binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ absences + failures + sex + romantic,
## family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1132 -0.8415 -0.5849 1.0329 2.1106
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.86714 0.24303 -7.683 1.56e-14 ***
## absences 0.09406 0.02323 4.049 5.15e-05 ***
## failures 0.60568 0.20812 2.910 0.00361 **
## sexM 0.98393 0.24789 3.969 7.21e-05 ***
## romanticyes -0.24606 0.26635 -0.924 0.35558
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 406.13 on 365 degrees of freedom
## AIC: 416.13
##
## Number of Fisher Scoring iterations: 4
According to the results:
Hypotheses 1 - 3 were supported by the data. For hypothesis 4, the effect was negative as predicted, but the effect is not statistically significant.
The confidence intervals of these variables are:
confint(model)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -2.36326577 -1.4080742
## absences 0.05079132 0.1421351
## failures 0.20530644 1.0277095
## sexM 0.50384493 1.4776693
## romanticyes -0.77779991 0.2691722
We can also interpret the coefficients as odds ratios by running them through the exponent function
coef(model) %>% exp
## (Intercept) absences failures sexM romanticyes
## 0.1545656 1.0986256 1.8324980 2.6749408 0.7818748
Now the coefficients indicate e.g. being Male increases the odds of high alcohol use by 2.675, and so on.
Let’s compare predicted vs real results with a 2x2 table
predictions <- as.factor(predict(model, newdata = data, type = "response")>0.5)
true = as.factor(data$high_use)
confusionMatrix(data=predictions, reference=true)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 247 77
## TRUE 12 34
##
## Accuracy : 0.7595
## 95% CI : (0.7126, 0.8021)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.006538
##
## Kappa : 0.3122
##
## Mcnemar's Test P-Value : 1.169e-11
##
## Sensitivity : 0.9537
## Specificity : 0.3063
## Pos Pred Value : 0.7623
## Neg Pred Value : 0.7391
## Prevalence : 0.7000
## Detection Rate : 0.6676
## Detection Prevalence : 0.8757
## Balanced Accuracy : 0.6300
##
## 'Positive' Class : FALSE
##
Total proportion of inaccurate results: 77 + 12 = 89. Total number of values was 370, so 89/370 = 0.2405 or about 24% of the data was predicted with the wrong label.
To see if this is a good result, we consider what result could be achieved with random chance. If we just flipped coins, we would get 50% correct. The logistic regression model is clearly better than that. However, we can also see that the proportion of non-high users is much larger than the proportion of high alcohol users. There are 259 non-high users versus 111 high users. Thus, ff we had a model that predicted everyone as non-high user, we would get 70% correct. Therefore, we can see that the model is actually not that much better than a model “no one is high user”.
As usual, let’s start with loading the libraries. Not sure which ones I’ll end up using I will load those we have used so far:
library(dplyr)
library(GGally)
library(ggplot2)
library(tidyr)
library(caret)
library(MASS)
Load the Boston dataset from the MASS package
data("Boston")
Explore the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The dataset describes the housing values in Boston and includes several values that could potentially affect housing values. These include: crime rate (crim), average number of rooms per house (rm), nitrogen oxide concentration (nox), property tax (tax), black (proportion of black residents), and so on.
Let us next create graphical plots of the dataset to visualize correlations:
ggpairs(Boston, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
The strongest correlations with median housing value are:
Each of the 4 strongest correlations make sense intuitively.
Show summary of the data:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Some notes about this summary:
Scale the dataset
Boston <- data.frame(scale(Boston))
summary(Boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Now each variable has mean 0 and std 1. Next, we will categorize the crime rate variable into quantiles
quants <- quantile(Boston$crim)
crimerate <- cut(Boston$crim, breaks=quants, include.lowest=TRUE)
Drop crimerate from original dataset
Boston2 <- Boston[,!(names(Boston) %in% c("crim"))]
Boston2$crim_cat <- crimerate
Divide into train/val datasets with 80% ratio
train_ind <- sample(nrow(Boston2), nrow(Boston2)*0.8)
Boston2.train <- Boston2[train_ind,]
Boston2.test <- Boston2[-train_ind,]
Fit Linear Discrimant Analysis (LDA)
lda.model <- lda(crim_cat ~., data=Boston2.train)
Plot bi-plot with arrows. Let’s use the example function provided on datacamp:
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
Plot with the fitted model
cats <- as.numeric(Boston2.train$crim_cat)
plot(lda.model, dimen=2, col=cats, pch=cats)
lda.arrows(lda.model)
Save categorical crime data form test set and remove it
cats.test <- Boston2.test$crim_cat
Boston2.test <- dplyr::select(Boston2.test, -crim_cat)
Now predict the categories for the test data and cross-tabulate
cats.pred <- predict(lda.model, newdata = Boston2.test)
table(correct = cats.test, predicted = cats.pred$class)
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## [-0.419,-0.411] 19 10 3 0
## (-0.411,-0.39] 3 17 6 0
## (-0.39,0.00739] 0 5 17 1
## (0.00739,9.92] 0 0 0 21
Best results were achieved with the largest class (Crime rate 0.00739 – 9.92). The second lowest category (-0.411 – 0.390) had the most wrong predictions. Percentage-wise, however, the second largest category had more wrong predictions (10 correct predictions, 9 wrong, 47.37% of the predictions were wrong). he second lowest category had the most false positives, and the lowest category the most false negatives.
Next we will explore the data using clustering methods, namely the k-means clustering approach. First, fit the model using k=4:
km.model <- kmeans(Boston, centers=4)
Plot results
pairs(Boston, col=km.model$cluster)
We can optimize the number of clusters by finding k so that the sum of squared errors within clusters is minimized. Using the code from DataCamp:
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
There is a sharp drop at k=2, and after that the drops are more gradual. We can plot kmeans with k=2.
km.model2 <- kmeans(Boston, centers=2)
Plot
pairs(Boston, col=km.model2$cluster)
It seems that 2 is indeed a good number of clusters; visual inspections shows that there are no clear outliers (i.e. possible third clusters that stick out)
As usual, let’s start with loading the libraries:
library(dplyr)
library(GGally)
library(ggplot2)
library(tidyr)
library(caret)
library(MASS)
library(reshape2)
Load the data we processed in the data wrangling section (actually, I’m using the ready made dataset)
data <- read.csv("data/human/human2.txt")
Print a summary of the data structure
str(data)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
The dataset includes human development metrics such as life expectancy, GNI, education level from different countries. Definitions for some metrics:
Plot variable pairs to visualize correlations
p <- ggpairs(data, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
The strongest correlations can be seen in
Scale the data
data.scaled <- data.frame(scale(data))
Now we shall run the principal component analysis (PCA) on the data. First with the non-scaled data.
pca.original <- prcomp(data)
Draw a biplot
biplot(pca.original, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Then again, this time for the scaled data
pca.scaled <- prcomp(data.scaled)
and plot
biplot(pca.scaled, choices = 1:2)
The results look much different. This is because the different metrics were measured using different units. The units are somewhat arbritrary (you can measure maternal mortality per 100k people or per 1 mil people), so they cannot be compared (apples and oranges). When we normalize the data, all metrics are squeezed between 0 and 1, after which we can better compare them.
When we look at the principal components in the last figure, we can draw see some patterns:
When we move to the left on the first principal component and up on the second component, we see Nordic countries that are highly developed in many areas.
When we move down along PC2, we see countries that are wealthy but may have issues with gender inequality (women with parliamentary representation and in labor force)
When we move to the right, we see countries that are less wealthy, and have problems with maternal mortality and girls giving birth at young age.
Lower right hand corner has countries that are poor, have gender inequality and the abovementioned problems connected to child birth.
Load the tea dataset:
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.2
data(tea)
Summary of the data:
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
Structure:
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
Dimensions:
dim(tea)
## [1] 300 36
The dataset has 36 columns and 300 rows. The columns (variables) are related to people’s tea drinking habits. Most of the variables are categorical and most of them have two levels such as sugar use (yes/no).
Do multiple correspondence analysis and visualize
MCA(X=tea[, -which(names(tea) %in% c('age'))], method="indicator")
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 35 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. of the categories"
## 4 "$var$cos2" "cos2 for the categories"
## 5 "$var$contrib" "contributions of the categories"
## 6 "$var$v.test" "v-test for the categories"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "intermediate results"
## 12 "$call$marge.col" "weights of columns"
## 13 "$call$marge.li" "weights of rows"
In the plot, we see the different variable levels drawn as points on two axes. The distance between the categories tells us how similar they are. Two of the lowest points in the graph are the age group 15-24 and student, meaning that these two levels are similar (young people tend to be students). We can also see some not-so-obvious patterns:
Black, green and Earl Grey teas are in completely different parts of the plot. This indicates that subjects with different characteristics (how they drink tea and their perception of the tea) affects what type of tea they drink. For example, older people prefer black tea, younger people prefer Earl Grey. Non-workers are closer to black tea and workers are closer to green tea.
Older people prefer unpackaged tea, younger people use packaged tea more often.
Workmen prefer cheap tea, older people prefer upscale tea.
Young people and/or students are clearly different from other groups.
People who drink tea 1 to 2 times/week are more likely to use sugar than those who drink 3 to 6 times/week. Older people don’t use sugar, workers are more likely to use sugar.
Load libraries
library(reshape2)
library(ggplot2)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
library(lmerTest)
##
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
##
## lmer
## The following object is masked from 'package:stats':
##
## step
We will start by analysing the RATS dataset using the analysis presented in chapter 8 from the MABS book.
Load data
catcols.RATS <- c("ID", "Group", "Time")
time.order <- c("WD1", "WD8", "WD15", "WD22", "WD29", "WD36", "WD43","WD44", "WD50", "WD57", "WD64")
RATS <- read.csv("data/chapter5/RATS.csv")
RATS[catcols.RATS] <- lapply(RATS[catcols.RATS], factor)
RATS <- RATS[-1]
RATS$Time <- factor(RATS$Time, levels=time.order)
Print summary to see what type of data we have:
summary(RATS)
## ID Group Time value
## 1 : 11 1:88 WD1 :16 Min. :225.0
## 2 : 11 2:44 WD8 :16 1st Qu.:267.0
## 3 : 11 3:44 WD15 :16 Median :344.5
## 4 : 11 WD22 :16 Mean :384.5
## 5 : 11 WD29 :16 3rd Qu.:511.2
## 6 : 11 WD36 :16 Max. :628.0
## (Other):110 (Other):80
We have four columns: ID, Group, Time and value. ID, Group and Time are categorical.
Next, visualize data:
ggplot(data=RATS, aes(x=Time, y=value, col=ID)) + geom_point()
Looks like the IDs split into two groups, those below 300 and those above 400. The ones that are above 400 show a slight increase over time. The below 300 group is more constant. The different IDs also seem to be correlated, when one ID increases, others increase too, and when one ID decreases, others do the same, and there is no clear delay. Between WD29 and WD44 there was little to no change, then increase continued. There are also some individual differences between IDs, some increase more than others. We can also plot by group:
ggplot(data=RATS, aes(x=Time, y=value, col=Group)) + geom_point()
Now we see more patterns. The below 300 group actually belongs to the same group. Group 2 has the second largest valeus, but there is one subject (ID 12) who is an outlier. It’s possible that this one ID was mislabeled.
Load data
catcols.BPRS <- c("treatment", "subject", "week")
week.order <- c("week0","week1","week2","week3","week4","week5","week6","week7","week8")
BPRS <- read.csv("data/chapter5/BPRS.csv")
BPRS[catcols.BPRS] <- lapply(BPRS[catcols.BPRS], factor)
BPRS <- BPRS[-1]
BPRS$week <- factor(BPRS$week, levels=week.order)
Visualize. First plot by subject:
ggplot(data=BPRS, aes(x=week, y=value, col=subject)) + geom_point()
There is quite a lot of variation between and within participants, so it’s hard to see any interesting patterns in this data. One thing we note is that the values seem to decrease until week 5 or 6, then the decrease stops and there is even increase in some participants.
Then let’s look at the values by treatment, it will probably be more interesting:
ggplot(data=BPRS, aes(x=week, y=value, col=treatment)) + geom_point()
Even with the treatment grouping we note that
Because we are considering the effects of different treatments, it might be also fruitful to compare data at the beginning and at the end of the experiment. Let’s plot a boxplot, first by subject:
ggplot(data=subset(BPRS, BPRS$week %in% c("week0", "week8")), aes(x=week, y=value, col=subject)) + geom_boxplot()
From this it’s again hard to see any clear trend, except that the values seem to decrease by week 8 for almost all subjects.
Then plot by treatment:
ggplot(data=subset(BPRS, BPRS$week %in% c("week0", "week8")), aes(x=week, y=value, col=treatment)) + geom_boxplot()
Now the decrease over time is even clearer. However, there seems to be no difference between the treatments.
Fit linear mixed effects model. The model is defined as follows:
lmer.m1 <- lmer(value ~ treatment*week + (1|subject), data=subset(BPRS, BPRS$week %in% c("week0", "week8")))
summary(lmer.m1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: value ~ treatment * week + (1 | subject)
## Data: subset(BPRS, BPRS$week %in% c("week0", "week8"))
##
## REML criterion at convergence: 618.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9188 -0.6116 -0.1409 0.5151 2.8695
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 27.77 5.27
## Residual 149.01 12.21
## Number of obs: 80, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 47.000 2.973 70.762 15.809 < 2e-16 ***
## treatment2 2.000 3.860 57.000 0.518 0.606
## weekweek8 -17.700 3.860 57.000 -4.585 2.53e-05 ***
## treatment2:weekweek8 2.250 5.459 57.000 0.412 0.682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) trtmn2 wekwk8
## treatment2 -0.649
## weekweek8 -0.649 0.500
## trtmnt2:wk8 0.459 -0.707 -0.707
The difference between treatments was 2.0 (treatment 2 higher than treatment 1) in the beginning, but the effect was not significant (t = 0.518, p = 0.606). The values decreased over time, by week8 the treatment 1 value was -17.70 smaller than in the beginning (t = -4.585, p<0.001). However, for treatment 2 value at week 8 was 2.250 higher than for treatment 1, which is almost the same as in the beginning. Therefore the treatment did not seem to have any effect.
The standard deviation of the random effect of subject (intercept) was 5.27, so there was some difference between subjects.
Based on this analysis, the treatment did not have any effect. This is in line with what we saw in the graphical data exploration. Most of the decrease in values comes from time.
We can print the random coefficients:
coef(lmer.m1)
## $subject
## (Intercept) treatment2 weekweek8 treatment2:weekweek8
## 1 50.85972 2 -17.7 2.25
## 2 44.56031 2 -17.7 2.25
## 3 48.72433 2 -17.7 2.25
## 4 47.22955 2 -17.7 2.25
## 5 51.18003 2 -17.7 2.25
## 6 43.27908 2 -17.7 2.25
## 7 52.03419 2 -17.7 2.25
## 8 49.47172 2 -17.7 2.25
## 9 42.95877 2 -17.7 2.25
## 10 50.64618 2 -17.7 2.25
## 11 53.42219 2 -17.7 2.25
## 12 45.73478 2 -17.7 2.25
## 13 45.94832 2 -17.7 2.25
## 14 44.02647 2 -17.7 2.25
## 15 49.25818 2 -17.7 2.25
## 16 45.73478 2 -17.7 2.25
## 17 45.84155 2 -17.7 2.25
## 18 42.53169 2 -17.7 2.25
## 19 42.74523 2 -17.7 2.25
## 20 43.81293 2 -17.7 2.25
##
## attr(,"class")
## [1] "coef.mer"
This shows the random intercepts for participants, with respect to the intercept shown in the model summary.
Perhaps the treatment will have different effect on each subject? After all, we saw in the plots that the subjects differed quite a lot. Furthermore, let’s add the rest of the weeks to the analysis:
BPRS$week_num <- as.numeric(substr(BPRS$week, 5, 8))
We can then add a random slope for each subject:
lmer.m2 <- lmer(value ~ treatment*week_num + (1+week_num|subject), data=BPRS)
summary(lmer.m2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: value ~ treatment * week_num + (1 + week_num | subject)
## Data: BPRS
##
## REML criterion at convergence: 2724.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0382 -0.6240 -0.0748 0.5322 3.9144
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 69.272 8.323
## week_num 1.057 1.028 -0.52
## Residual 97.076 9.853
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 47.8856 2.3016 27.7107 20.805 < 2e-16 ***
## treatment2 -2.2911 1.9150 318.0005 -1.196 0.2324
## week_num -2.6283 0.3657 38.6133 -7.187 1.26e-08 ***
## treatment2:week_num 0.7158 0.4022 318.0005 1.780 0.0761 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) trtmn2 wek_nm
## treatment2 -0.416
## week_num -0.649 0.462
## trtmnt2:wk_ 0.350 -0.840 -0.550
This did not change the results markedly. Treatment 2 had -2.2911 lower value. Each week the value lowered by -2.623 (t=-7.187, p < 0.001). Treatment does not seem to have any effect.
Here are the coefficients:
coef(lmer.m2)
## $subject
## (Intercept) treatment2 week_num treatment2:week_num
## 1 49.07248 -2.291111 -1.223339 0.7158333
## 2 47.05039 -2.291111 -3.280383 0.7158333
## 3 47.72686 -2.291111 -3.211386 0.7158333
## 4 49.83813 -2.291111 -2.437827 0.7158333
## 5 66.68207 -2.291111 -4.342611 0.7158333
## 6 42.56109 -2.291111 -2.550753 0.7158333
## 7 54.71328 -2.291111 -3.500714 0.7158333
## 8 47.79551 -2.291111 -1.530085 0.7158333
## 9 42.09978 -2.291111 -2.985474 0.7158333
## 10 53.35978 -2.291111 -2.552101 0.7158333
## 11 59.93477 -2.291111 -1.927352 0.7158333
## 12 48.45376 -2.291111 -2.880199 0.7158333
## 13 50.43280 -2.291111 -3.122746 0.7158333
## 14 40.84752 -2.291111 -2.186925 0.7158333
## 15 55.19374 -2.291111 -3.407935 0.7158333
## 16 44.45802 -2.291111 -2.947707 0.7158333
## 17 43.78081 -2.291111 -1.592874 0.7158333
## 18 36.66185 -2.291111 -1.757613 0.7158333
## 19 38.07845 -2.291111 -2.472162 0.7158333
## 20 38.97001 -2.291111 -2.656482 0.7158333
##
## attr(,"class")
## [1] "coef.mer"
We can see that there was quite a lot of variation in how participants responded each week.